## [1] "/Users/TARDIS/Documents/STUDIES/context_word_seg"

How many utterances per recording?

What is the rhythm of activities like over each transcript?

Sequence plots: Show the sequence of coded contexts over time for each transcript.

In each of the following plots, the activity contexts are shown on the y-axis, and the course of the transcript runs along the x-axis, from the first utterance to the last utterance. Darkening in the colored bar for each context indicates that that context is coded as happening at that utterance. Utterance number is a rough proxy for time of day since time stamps are not available in the corpus; while there are obvious shortcomings with this noisy measure, it does preserve the sequential ordering of events even if it loses information about precise timing.

There are several general methodological points of interest visible in these plots. When defining ‘activity context’ using the word list approach, there can be (and often are) sections of the corpus that receive no context tag at all (vertical sections in the plot where there are no darkened context bars). This is less common with the other approaches to defining context, giving the word list plots a relatively sparse appearance. In the coder judgment plots, there are some vertical bars with no data at all — those represent the few parts of the corpus that are not fully coded (there are not 5 independent coders for each utterance). In order to determine context using topic modeling, I binned the transcripts into documents of 30 utterances each (necessary to allow large enough samples of speech to estimate the word co-occurrence rates on which topic modeling algorithms are based). This means the topic modeling approach tags the corpus in bins of 30 utterances, resulting in chunks of each context that are at least 30 utterances long. The word list method identifies chunks of at least 5 utterances due to the smoothing procedure that picks up 2 utterances before and after each tagged utterance. The coder judgments are based on sections of the corpus that were 30 utterances long, but because each utterance is often coded by more than the required 5 coders a random sample of 5 coders for each utterance results in a more natural gradient at the edges of context chunks.

Note that contexts with fewer than 100 total utterances across the entire corpus (less than 1% of the corpus) are not shown. This omits 1 context using the word list definition of context and 13 contexts using the coder judgments.

Example 1: la16

The following are four sequence plots from one transcript (child la at 16 weeks, the longest transcript in the corpus).

Defining context using key words from the word lists: Defining context using coder judgments: Defining context using topics from LDA topic modeling: Defining context using topics from STM topic modeling (allows for variability family to family when discovering topics):

It is also possible to see hints of agreement across methods by examining the plots. For example, there is a section around utterance 1300 that appears to be identified as “bath time” by human coders, includes words from the “bath” word list, and is mostly topic 8 by the STM topic modeling (that episode is not as clearly picked out by the LDA topic modeling, although perhaps topics 10 and/or 1 correspond). Note that the LDA topic modeling identifies mostly one context for the duration of the transcript (topic 11), possibly getting caught on family-specific words like the child’s name.

Example 2: gl06

The following are four sequence plots from one transcript (child gl at 6 weeks). There appear to be a couple naps during this transcript: one around utterance 150 and another beginning around utterance 400 (and possibly another right at the beginning of the transcript, as identified by the word list and coder judgment methods only). There also appears to be a bath around utterance 550, according to the word list and coder judgment plots, although it is not marked with topic 8 in the STM plot, unlike with the previous example. Again, the LDA topic modeling uses one topic heavily throughout the transcript (topic 7).

Defining context using key words from the word lists: Defining context using coder judgments: Defining context using topics from LDA topic modeling: Defining context using topics from STM topic modeling (allows for variability family to family when discovering topics):

How are the contexts distributed across families and transcripts?

These plots show how many utterances fall in each context for each transcript (each child at each age), for each of the approaches to defining context. For each approach, there is one plot showing raw number of utterances in each transcript and a second plot presenting the same information in terms of proportion of total context codes for each transcript.

One of the most important things to note in these plots is which contexts tend to be distributed more or less evenly across families and ages, and which contexts appear to be family- or age-specific. There is a particularly striking difference between the two topic modeling methods, LDA and STM, with the latter showing rather even distribution of contexts whereas the former appears to have a strong tendency to pick one or two dominant topics for each family. When contexts are defined according to LDA topic modeling, most of the transcripts are dominated by one context (and that same context often repeats across recordings for that family); since these are day-long, natural recordings of infants at home with their caregivers, it is surprising to see a single context characterizing an entire transcript. The fact that this homogeneity appears in the LDA method only suggests it may be an artifact of that analysis procedure. This difference may be due to the fact that STM (unlike LDA) allows for variability family to family in the prevalence and characteristics of each topic during estimation. Because LDA lacks this flexibility, it may get caught up on common words that are family-specific and miss patterns of words that vary within families as a consequence. The most obvious example of family-specific words are the children’s names, but there are several other words that, for one reason or another, appear often only in one family and not the others.

Defining context using key words from the word lists:

Defining context using coder judgments:

Defining context using topics from LDA topic modeling:

Defining context using topics from STM topic modeling:

Note that both STM and LDA topic models we run after removing stop words. Many stop words are very high frequency in samples of natural speech, so they may show up as very probably under a given topic not because of their specificity to that topic but because of their general high probability of occurring regardless of topic.

Top words for each topic, in order
topic_1 topic_2 topic_3 topic_4 topic_5 topic_6 topic_7 topic_8 topic_9 topic_10 topic_11 topic_12
oh hey go yes hello yes oh ya oh oh mm oh
hey oh hey oh hmm oh come go go yes mummi yes
dear mummi want well oh tell dear come come ah yes tickl
hannah yes can boo dear go yes oh yes dear eh dear
alright hannah hmm hello got hey tell littl ya one girl go
trea hello come go gillian come good got eh clean hello come
ssh look nice say matter dear got yes bath two come got
yum hold see come smile look darl look alright wee oh big
darl smile smile dear look stori girl like get dirti good see
like shh littl clever hey daddi mum want like now alright good
stretch girl look christoph want got windi bit want pet go thumb
thank hmm like hey littl can mummi make mum dri hey hey
chang chou play ya go get nappi see good hair big oop
cri love hold know mummi nois hey eh arm wash know fat
nois bubbl now boy face funni stori nice splash away darl one
now dear take good like teddi just just back get nice toe
pie lambchop hand hand ah make put shall now got girli bad
beebo nice oh can big smile minut get big hide hannah finger
dub blow well wrong funni nose chang one kick nice better back
girl littl kick look girl round pet big gonna anoth daddi get